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Abstract 

Density mode clustering is a nonparametric clustering method. The clusters are the basins 
of attraction of the modes of a density estimator. We study the risk of mode-based clus¬ 
tering. We show that the clustering risk over the cluster cores — the regions where the 
density is high — is very small even in high dimensions. And under a low noise condition, 
the overall cluster risk is small even beyond the cores, in high dimensions. 

Keywords: Clustering, Density Estimation, Morse Theory 


1. Introduction 

Density mode clustering is a nonparametric method for using density estimation to find 


clusters 

(Cheng 1995 

Comaniciu and Meer 

2002, 

Arias-Castro et al., 

2013 1 

Chacon and 

Duong 

2013 

). The basic idea is to estimate the modes of the density, and then assign points 


to the modes by finding the basins of attraction of the modes. See Figures |T] and [5| 

In this paper we study the risk of density mode clustering. We define the risk in terms 
of how pairs of points are clustered under the true density versus the estimated density. We 
show that the cluster risk over the cluster cores — the high density portion of the basins 
- is exponentially small, independently of dimension. Moreover, if a certain low noise 
assumption holds then the cluster risk outside the cluster cores is small. The low noise 
assumption is similar in spirit to the Tsyabakov low noise condition that often appears in 
the high dimensional classification literature (Audibert and Tsybakov.-, 2007). 

It is worth expanding on this last point. Because mode clustering requires density 
estimation — and because density estimation is difficult in high dimensions — one might 
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get the impression that mode clustering will not work well in high-dimensions. But we show 
that this is not the case. Even in high dimensions the clustering risk can be very small. 
Again, the situation is analogous to classification: poor estimates of the regression function 
can still lead to accurate classifiers. 

There are many different types of clustering — &:-means, spectral, convex, hierarchical 
— and we are not claiming that mode clustering is necessarily superior to other clustering 
methods. Indeed, which method is best is very problem specific. Rather, our goal is simply 
to find bounds on the performance of mode base clustering. Our analysis covers both the 
low and high-dimensional cases. 

Outline. In Section 2 we review mode clustering. In Section 3 we discuss the estimation 
of the clusters using kernel density estimators. Section 4 contains the main results. After 
some preliminaries, we bound the risk over the cluster cores in Section 4.3. In Section 4.4 we 
bound the risk outside the cores under a low noise assumption. In Section 4.5 we consider 
the case of Gaussian clusters. In Section 4.6 we show a different method to bound the risk 
in the low dimensional case. Section 5 contains some numerical experiments. We conclude 
with a discussion in Section 6. 


Related Work. Mode clustering is usually implemented using the mean-shift algorithm 


Fukunaga and Hostetler ( 

1975); 

Cheng ( 

1995); 


(2002). The algorithm is analyzed in Arias-Castro et al. (2013). Li et al. (2007); Azzalini 


and Torelli (2007) introduced mode clustering to the statistics literature. The related idea 


of clustering based on high density regions was proposed in Hartigan (1975). Chacon et al. 


(2011) and Chacon and Duong (2013) propose several methods for selecting the bandwidth 
for estimating the derivatives of the density estimator which can in turn be used as a 
bandwidth selection rule for mode clustering. A method that is related to mode clustering 


is clustering based on trees constructed from density level sets. See, for example, Chaudhuri 


and Dasgupta (2010), Kpotufe and von Luxburg (2011) and Kent et al. (2013). 


Notation: We let p denote a density function, g its gradient and H its Hessian. A point 
x is a local mode (i.e. a local maximum) of p if ||g(;r)|| = 0 and all the eigenvalues of 
H(x ) are negative. Here, || ■ || denotes the usual L 2 norm. In general, the eigenvalues of a 
symmetric matrix A are denoted by Ai > A 2 > ■ • •. We write a n A b n to mean that there 
is some C > 0 such that a n < Cb n for all large n. We use B(x, e) to denote a closed ball of 
radius e centered at x. The boundary of a set A is denoted by dA. 


2. Mode Clustering and Morse Theory 

Here we give a brief review of mode clustering, also called mean-shift clustering; more details 
can be found in Cheng (1995), Comaniciu and Meer (2002), Arias-Castro, Mason, Pelletier 
(2014) and Chacon (2012). 
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Figure 1: Left: a simple dataset. Middle: the kernel density estimator. Right: The four 
estimated modes and their basins of attractions. 


2.1 Morse Theory 


We will need some terminology from Morse theory. Good references on Morse theory include 


Edelsbrunner and Harer 

(2010) 

Milnor 

(1963) 

Matsumoto 

(2002) 

Banyaga and Hurtubise 


(2004). 


Let p be a bounded continuous density on with gradient g and Hessian H. A point 
x is a critical point if ||y(x)|| = 0. We then call p{x) a critical value. A point that is not a 
critical point is a regular point. 

The function p is a Morse function if all its critical values are non-degenerate (i.e. 
the Hessian at each critical point is non-singular). A critical point x is a mode, or local 
maximum, if the Hessian H(x) is negative definite at x. The index of a critical point x 
is the number of negative eigenvalues of H{x). Critical points are maxima, minima or 
saddlepoints. 

The flow starting at x is the path tt x : M —> satisfying tt x (0) = x and 

= Vp(7r x (t)). (1) 

The flow 7 v x (t) defines the direction of steepest ascent at x. The destination and origin of 
the flow n x are defined by 

dest(x) = lim ir x (t), org(x) = lim 7p(x). (2) 

t—>oo t— >—oo 

If x is a critical point, then dest(x) = x. 

The stable manifold corresponding to a critical point y — also called the descending 
manifold or the basin of attraction — is 

C{y) = jx : dest(x) = yj. (3) 

In particular, the basin of attraction of a mode m is called a cluster. See Figures [2] and [3j 
Let us mention a few properties of Morse functions that are useful: 
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Figure 2: A Morse function with four modes. Each solid blue dot is a mode. Each red dot 
is a minimum. Pink dots denote saddle points. The green area is the descending 
manifold (cluster) for one of the modes. 



Figure 3: The three large black dots are the three local modes that induce three clusters based 
on the corresponding basins of attraction. The cluster boundaries, D, consists of 
the local minima (the square box, Do) and the three thick smooth curves are D\. 
The circles on the boundaries are saddle points. The dotted lines show the flow 
lines. 
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1. Excluding critical points, two flow lines are either disjoint or they are the same. 

2. The origin and destination of a flow line are critical points (except at boundaries of 
clusters). The set of points x whose destinations are not modes are on the boundaries 
of clusters and form a set of measure 0. 

3. Flow lines are monotonic: p(xt) is a non-decreasing function of t, where xt = 7f x (t). 
Further, p( dest(x)) > p(org(x)) and dest(x) / org(x) if x is a regular point. 

4. The index of dest(x) is greater than the index of org(x). 

5. The flow has the semi-group property: 4>(x,t + s) = 4>(4>(x, t ), s ) where <j){x, t) = Tr x (t). 

6 . Let C be the basin of attraction of a mode m. If y is a critical point in the closure of 
C and y ^ m, then y E dC. 


2.2 Clusters 

Consider a distribution P on K, C with density p. We assume that p is a Morse function 
with finitely many critical points. The modes of p are denoted by 


M = {mi,.. .,m k } 

The corresponding clusters are C\,...,C k where Cj = : dest(x) = rrij j. 

clustering function c:/Cx/C—>■{ 0,1} by 


( 4 ) 

Define the 


c{x, y) = 


1 if dest(x) = dest(y) 
0 if dest(x) / dest(y). 


Thus, c(x, y) = 1 if and only if x and y are in the same cluster. 

Let X \,..., X n G be random vectors drawn iid from P. Let p be an estimate of 
the density p with corresponding estimated modes M. = {mi,..., me}, and basins C = 
{C\,..., Ci}. This defines a cluster function c. 

In this paper, the pairwise clustering loss is defined to be 


L = i E J (^ X j, X k) + c(X v X k 
^ 2 ) j<k 


( 5 ) 


which is one minus the Rand index. The corresponding clustering risk is R = E[L]. 


3. Estimated Clusters 


Estimating the clusters involves two steps. First we estimate the density then we estimate 
the modes and their basins of attractions. To estimate the density we use the standard 
kernel density estimator 


Ph{x) 


1 

n 



i=i 


(tM). 


( 6 ) 


We will need the following result on the accuracy of derivative estimation. We state the 
result without proof as it is a simple generalization of the result in Gine and Guillou (2002) 
which is based on Talagrand’s inequality. In fact, it is essentially a different way of stating 
the results of Lemmas 2 and 3 in Arias-Castro et al. (2013). 
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Mean Shift 

1. Choose a set of grid points G = {g \,..., Usually, these are taken to be the data 
points. 

2. For each g G G, iterate until convergence: 

(r+i), ZiXimg {r) -Xj\\/h) 

Eim^ r) -Xi\\/h) ■ 

3. Let M. be the unique elements of {g[°°\ ■ ■ ■ ,g Output ... , M. and 

dest(^) = gf°\ 

Figure 4: The Mean Shift Algorithm 


Lemma 1 Let Ph{x) = E[p) l (x)]. Assume that the kernel is Gaussian. Also assume that p 
has bounded continuous derivatives up to and including third order. Then: 

(1: Bias) There exist Co, ci, C 2 such that 

sup\p h (x)-p(x)\ < c 0 h 2 , sup||Vp h (x)-Vp(x)|| < Cih 2 , sup \\V 2 p h (x) - V 2 p(z)|| < c 2 h. 

XX X 

(2: Variance) There exist b,bo,bi,b 2 such that, if (log n/ri) 1 / d < h < b where b < 1, 
then, 

P(sup \p h (x) - p h (x) | > e) < e ~ b ° nhd ^ 

X 

P(sup||Vp/ i ,(x) - Vp h (x)\\ > e) < e -Anh d + 2 e 2 

X 

P(sup \\v 2 p h {x) - V 2 p h .{x) || > e) < e ~ b ^ nhd+4e \ 


Remark: It is not necessary to use a Gaussian kernel. Any kernel that satisfies the 


conditions in Arias-Castro et al. (2013) will do. 


To find the modes of ph we use the well-known mean shift algorithm. See Figures [4] and 
§ The algorithm approximates the flow defined by 0. The algorithm finds the modes, the 
basins of attractions and the destination dest(.x) of any point x. A rigorous analysis of the 


algorithm can be found in Arias-Castro et al. (2013) 


4. Bounding the Risk 

We are now ready to bound the clustering risk. We begin by introducing some preliminary 
concepts. 
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Figure 5: An illustration of the mean shift algorithm. The data are moved to the two modes 
along their gradient ascent paths. 


4.1 Stability 

To bound the clustering risk, we need to control how much the critical points can change 
when the density is perturbed. In particular, we need the following result which is Lemma 
16 from Chazal et al (2015). 

Lemma 2 Let p be a density with compact support. Assume that p is a Morse function 
with finitely many critical values C = {c\,... ,cl} and that p has two continuous derivatives 
on the interior of its support and non-vanishing gradient on the boundary of its support. 
Let q be another density and let rj = maxjryo, r)i, 772 } where 

Vo = sup|p(.x) - q{x) I, pi = sup ||Vp(x) - Vg(x)||, ?y 2 = sup ||V 2 p(x) - \7 2 q(x)\\ 

XX X 

where V 2 is the vec of the Hessian. There are constants k = n(p) and A = A(p) such 
that, if r] < k then the following is true. The function q is Morse and has L critical points 
C' = {c\ ,..., c' L }. After a suitable relabeling of the indices, cj and c'- have the same Morse 
index for all j and max^ ||Cj — c'-|| < A(p)r], 

4.2 The Cluster Cores 

An important part of our analysis involves, what we refer to as, the cluster cores. These are 
the high density regions inside each cluster. Consider the clusters C = {C\,... ,0^}- Define 

£7 = sup p(x) (7) 

xGdCj 

where dCj is the boundary of Cj. For any a > 0 we define the j th cluster core by 

Cj(a) = jx G Cj : p(x) > £j + aj. (8) 

See Figure [6j 
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Theorem 3 Let p be a density function with compact support. Assume that p is a Morse 
function with finitely many critical values and that C g = sup^, ||g(x)|| < oo where g is the 
gradient of p. Let p be another density and define y, r/o, y\ , 772 , A(p) and n(p) as in Lemma 
^ Let 7 r denote the paths defined by p. Let C be a cluster of p with mode rn and let 
f = sup X £dcP( x ). Let a = C g Ay + 2yo. Assume that y < n(p) and that 

p(m) > a + ArjCg + £ = 2 AyC g + 2y 0 + £. (9) 


Then the following hold: 

1. //xGC^(a) then d(x,dC) > -<j- where d(x, A) = inf y£ a \\x — y\\. 

2. B(m, Ay) <Z C\a — 2yo). 

3. p has a mode fh G {a — 2yo). 

4■ p has no other critical points in (a — 2yo). 

5. Let x G C\a). Then Tr x (t) G C^(a — 2yo) for all t > 0. 

6. Let x,y G C^(a). Then dest(x) = dest(y) = m and dest(x) = dest(y) = fh. Hence, 
c(x,y) = c(x,y). 


Proof 

1. Let z be the projection of x onto dC. (Choose any projection if it is not unique.) 
Using an exact Taylor expansion, 


£ + a < p(x) = p{z) + (x 

< p(z) + Cg\\x - z\ 

< £ + C g d(x , dC). 


z) T / g(z + u(x — 

Jo 

p(z) + C g d(x, dC ) 


z)) du 


2. Let x G B(m,Arf). Then 


p(x) = p(m ) + (x — m) T / g(m + u(x — m))du > p(m) — ||ic — m\\C g > p{m) — AyC g > a + £ 


and hence x G (a) cd(a- 2r/o). 

3. By Lemma [2j p has a mode m such that \\m — m\\ < Ay. The result then follows 
from part 2. 

4. Let c be a critical point of p different from fh. By Lemma |2j there is a critical point 
c of p such that ||c — c|| < Ay. Now c must be on the boundary of some cluster or must be 
a minimum. Either way, it is not in the interior of C. Let r be the point on dC closest to 
c. Then d(c,C^(a — 2yo)) > d(r,C^(a — 2yo)). By part 1, d(r,C^(a — 2yo)) > (a — 2t/q) /C g . 
Thus, d(c,C^(a — 2yo)) > (a — 2yo)/C g — Ay. By the definition of a, it follows then that 
d(c,C^(a — 2?7o)) > 0 and hence c ^ C^(a — 2yo). 

5. Let x G C^{a). Then, for any t > 0, 


p(nx(t)) > p(nx(t)) ~Vo > p(n x { 0)) - Vo 

= p{x) -y 0 > p(x) -2y 0 > £, + a - 2y 0 . 
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6. Let x,y E C\a). Trivially, we have that dest(x') = dest(y) = m. From the previous 
result, dest(x) E C^{a — 2 t]q). From parts 3 and 4, the only critical point of p in C^(a — 2r/o) 
is fh. Similarly for y. Hence, dest(x) = dest(y) = fh. ■ 


4.3 Bounding the Risk Over the Cores 

Now we bound the risk for the data points that are in the cluster cores. 

Theorem 4 Assume that p is a Morse function with finitely many critical values. Denote 
the modes and clusters by m\, ..., rn^ and C\ ,..., . Let fih be the kernel density estimator. 

Let 7] = max{?7o, pi, 772 } where 

Vo = sup | p h (x) -p(x)\, 771 = sup 11Vp/,0) - Vp(x)||, 772 = sup | \V 2 ph(x) - V 2 p(x)||. 

XX X 

Let a = CgArj + 2rjo and let C t Cj(a) and let X = {Xi : X t E C^(a)} be the points in 
the cores. Let (j = sup{p(x) : x E dCj}. 

1 ■ If 

p(mj) > 2Ar]C g + 2r / 0 + £7 

for each j, then c(Xj,Xj) = c(Xj, Xj) for every Xi,Xj E X . 

2. If h n —> 0 and —> 00 , then 

P (d(Xi,Xj) ficiXi^Xj) for any Xi,Xj E x\ < e~ nb (10) 

for some b > 0 (independent of d). 

Remark: Note that 77 , ?7o, 771,772 are functions of n but we suppress the dependence for 
simplicity. 

Proof 1. From Lemma [lj we have that P(t? > n(p)) is exponentially small. Hence, 
Lemma [ 2 ] applies. If p{mj ) > 2Ar]C g + 2r/o + £ for all j , then Theorem [ 3 ] implies that 
c(X l ,X j ) = c(Xi, Xj) for every X u Xj E X. 

2. We need to show that p(jnj) > 2Ar]C g + 2rjo + for all j so we can apply part 1. 

The probability that p(rrij ) > 2ArjC g + 2 ? 7 o + £7 fails for some j , is > q ) where q > 0 
is a constant. If h n —> 0 and nhffi 4 —> 00 , then from Lemma [lj P(r 7 > q) is exponentially 
small: 

2 

P(t? > q) <^2^(vj > Q ) 

j =0 

< exp (^-bonh^q - c 0 /i 2 ) 2 ) + exp (-binh^ 2 (q - cih 2 n ) 2 ) + exp (^-b 2 nh^ +4: (q - c 2 h n ) 2 

< e~ nb 

for some b > 0. ■ 
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4.4 Beyond the Cores 

Now we bound the risk beyond the cores. Furthermore, we explicitly let d = d n increase 
with n. This means that the distribution also changes with n so we sometimes write p as 
Pn- 

Theorem [4] shows that the risk over the cores where p(x) > £ + a is exponentially small 
as long as we take a = Cr/ for some C > 0. The total risk is therefore the exponential 
bound plus the probability that a point fails to satisfy p(x) > £ + a. Formally: 

Corollary 5 Assume the conditions of Theorem [/[ The cluster risk is bounded by 

P(p(X) <i + Cp) + e~ nb . (11) 


Note that, in the corollary, it is not necessary to let h —> 0. To further control the 
risk beyond the cores, we need to make sure that P(p(X) < £ + Crj) is small. To do this, 
especially in the high-dimensional case, we need to assume that the clusters are well-defined 
and are well-separated. We call these assumptions “low noise” assumptions since they are 
similar in spirit to the Tsybakov low noise assumption that is often used in high-dimensional 
classification (Audibert and Tsybakov, 2007). Specifically, we assume that following: 

(Low Noise Assumptions:) 


1. Let u n be the minimal distance between critical points of p n . We assume that cr = 
lim inf n a n > 0 . 

2. Let m n be the number of modes of p n . Then limsup, woo m n < oo. 

3. hmn—^oc miiij p n incij ) > 0. 

4. in < for some 7 > 0 where = sup xg £)p n (x) and D = (J ■ dCj. 

5. For all small e, P(p n (X) < e) < e 13 where P = fid is increasing with d. 


Parts 1-3 capture the idea that the clusters are well-defined. It is really parts 4 and 5 
that capture the low noise idea. In particular, part 4 says that the density at the cluster 
boundaries is small. (See Figure|hj) Part 5 rules out thick tails. Note that for a multivariate 
Normal N(0,a 2 I ), we have that, for any fixed small e > 0, P(p(X) < e) < e~ d when a is 
not too large. So part 5 automatically holds for distributions with Gaussian-like tails. 


Theorem 6 Assume that p n is Morse and that the low noise conditions hold. Assume that 
p n has three bounded continuous derivatives . Let h n x n -1 /( 5+d ). Then the clustering risk 
R satisfies 


R A 





+ e~ nb . 


In particular , R = 0(^/logn/n) when (3d > max{(d + 5)/2, 1 /( 27 )}. 


Proof For points in the core, the risk is controlled by Theorem [4j We need now bound 
the number of pairs outside the cores. For this, it suffices to bound 


P (p n {X) <in + CgArj + 2770 ). 
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Figure 6: Left: When clusters are not well separated, f is large. In this case, the mass 
inside the cluster but outside the core can be large. Right: When clusters are well 
separated, f is small. The blue lines correspond to p(x) = £ + a for a > 0. The 
pink regions are the cluster cores. 


For this choice of bandwidth, Lemma [l] implies that r/ = Op (log n/n 5+d ). From the low 
noise assumption, the above probability is bounded by fn V (\ogn / n) 13 ^ d+b \ ■ 


Remark: Parts 4 and 5 of the low noise assumption can be replaced by a single, slightly 
weaker assumption, namely, P(\p n {X) — £ n j| < e) < e 13 where = sup xgaCj p{x). The 
condition only need hold near the boundaries of the clusters. 


4.5 Gaussian Clusters 


Recently, Tan and Witten (2015) showed that a type of clustering known as convex clustering 
yields the correct clustering with high probability, even with increasing dimension, when 
the data are from a mixture of Gaussians. They assume that each Gaussian has covariance 
a 2 1 and that the means are separated by a factor of order \fd. Here we show a similar 
result for mode clustering. The clustering is based on a kernel estimator with a small but 
fixed bandwidth h > 0. 

Let X\,..., X n ~ a2 I) so that X t has density 


P( x ) = H 


7 Ti 


1=1 


a d (2ir) d / 2 


-\\X- H \\^/{2P) 
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Lemma 7 Let X ~ p and let e > 0. Suppose that 

' l/d \ d 


e < min 


7T ■ 




ircre 


16 


and that 


min||/ij — Hk\\ > 2crmax 
i¥=k j 




+ 2 log ( 1 ) - 2 log ^ 1 


7T, 


( 12 ) 


(13) 


Then 


P(p(X) < e) < e~ sd . 

Remark: Given the condition on e, we can re-write (|13|) as 


min 11 [ij — pk\ I > C'Vd 


for a constant C' > 0. 

Proof Let 


(_Li 

\ecr d (2ir) d / 2 ) 


c = min a 2 log , ., . /0 

j V \ea d ( 2 'K ) d / 2 , 


and let Bj = {x : \\x — tij\\/a < e}, j = 1,... ,k. The sets Bi,..., B^ are disjoint due to 


(13J)- 

First we claim that 


p(x) < e implies that x E B^j = P| B c s . 


To see this, let x E Bj for some j. Then, from the definition of Bj and c, 


„( T ) = V ___p IIMs11 2 /(2cr 2 ) > -||X— W || 2 /(2 (t 2 ) > 

^ a d ( 2^) d / 2 - a d ( 2 ir ) d / 2 


TT i 


a u, (2'K) d / 2 


rr d 


r c 2 / 2 > 


e. 


That is, x E -Bj for some j implies ji(x) > e and so the claim follows. 

Let Y E {1,..., &} where P(Y = j) = it j. We can write X = I(Y = j)Xj where 

N(p,j,a 2 I). Of course, X = Xj when Y = j. Note that \\Xj — pj\\ 2 /a 2 ~ X 2 j■ Hence, 


P{p{X) < e) < P (x G f| B c s j = £ *jP G f| B s 

=e e n B b ^ e e b j 

j s j 

Y^ 3 p (Uj ~^ 


Y = j 


> c 


E n i P Ud >C 2 ) =p (x 2 d > c 2 ) . 
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From Lauren and Massart (2000), (see also Lemma 11 of Obizinski et al) when t > 2d, 

P(xl>i)< exp (~(l-2\/f)) 

The last quantity is bounded above by e - */ 4 when t > 32 d. By the condition on e, c 2 > 32 d. 
HgI1C6 

P(p(X) < e) < P (x 2 > c 2 ) < e" c2/4 < e~ 8d . 


Theorem 8 Let X\,..., X n ~ o' 2 /). Let p^ be the kernel density estimator 

with fixed bandwidth h > 0 satisfying 


0 < h < - min 
2 j 


7n 


%/27T< 


7T(Te 


16 




Let D = (J ■ and define T = miri ? d(p,j, D). Suppose that p is Morse, 


r > aJ 32 d + 2 log 


and that 


min \ \fij — Hk\\ > 2crmax 
j 




- -2 log - . (14) 


Then, for all large n, 


— 8 d . „—nb 


P c{Xj,Xjf) c(Xj,Xk) for some j, / < e s “ + e 


Proof By Corollary 5, the cluster risk is bounded by P(p(X) < £ + Crj ) + e~ nb . With a 
fixed bandwidth not tending to 0, the bias dominates for all large n, and so r) < ch for some 
c > 0, except on a set of exponentially small probability. The condition on T implies that 


£ 


sup p(x) < min 

x£ D o 



d 


So e = ^ + Cp = f, + ch satisfies (12) 




Remark: The theorem implies the following. As long as the means are separated from 
each other and from the cluster boundaries by at least \fd, then a kernel estimator has 
cluster risk e~ 8d + e~ nb . It is not necessary to make the bandwidth tend to 0. 
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4.6 Low Dimensional Analysis 

In this section we assume that the dimension d is fixed. In this case, it is possible to use a 
different approach to bound the risk. We do not make the low noise assumption. The idea 
is to use results on the stability of dynamical systems (Chapter 17 of Hirsch, Smale and 
Devaney 2004). As before p is a Morse function and p is another function. Define r/.f, C g 
and (a) as in the previous sections. 

Let C be a cluster with mode m. Choose a number a such that 

0 < a < p(m) — Ar/Cg — £. (15) 

For any x in the interior of C, let 

f(x)=inf|f: n x (t) G (a) j. (16) 

If x G dC then t{x) = oo since n x (t) converges to a saddlepoint on the boundary. But for 
any interior point, t(x) < oo. For x € C^(a) we define t(x) = 0. 

Our first goal is to control the difference \\n x (t(x)) — 7 r x (t(x))||. And to do this, we first 
need to bound t(x). Let 

A(x) = inf |b( 7 r x (t))||. (17) 

0<t<t(x) 

Now, A(x) > 0 for each x £ dC. However, as x gets closer to the boundary, A(x) approaches 
0. We need an assumption about how fast A(x) approaches 0 as x approaches dC which is 
captured in the following assumption: 

(B) Let Bs = {x € C : d(x,dC) = 5}. There exists 7 > 0 such that, for all small 6 > 0, 

x G Bs implies that A(x) > c<5 7 . (18) 

Lemma 9 Assume condition B. If d(x,dC) > 6 then t(x) < p(m)/6 27 . 

Proof Let z = n x (t(x)) and x(s) = 7r x (s). Then, 

p{m) > p(z) - p(x) = J ds ’’ ds = j g(x(s)) T ir' x {s)ds 

|| 5 ((x(s))|| 2 (is > t(x) A 2 (x) = t(x)8 2ry . 



Now we need the following result which is Lemma 6 of Arias-Castro et al (2013) adapted 
from Section 17.5 of Hirsch, Smale and Devaney (2004) 

Lemma 10 Let r/i = sup x ||Vp(x) — Vp(x)||. For all t >0, 

\\*x(t)- 7T x (t)\\ < (19) 

d 

where K 2 = sup^ ||V 2 p(x)||. 
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We now have the following result. 


Theorem 11 Let 

s _ f K 2 y/dp(m) \ 27 

log(«2 V^/V^Ti) / 


( 20 ) 


Let x,y £ C. Suppose that d(x,dC) > 5 and d(y,dC) > 5. Also, suppose that rj\ < a 2 /C g . 
Then, for all small rj, dest(x) = dest(y). 


Proof We prove the theorem in the following steps: 

1. If x £ C^(a) and e < a/C g then B(x, e) C C^(a') where a! = a — eC g > 0. 

Proof: Let y £ B(x,e). Expanding p(y) around x, p{y) > p{x) — \\y—x\\C g > ^+a—eC g = 
f + a'. 

2 . Let t = t(x). If d(x, dC) > 5 then 7r x (t) £ C^(a') where a' = a — y/fj\C g > 0. 

Proof: By definition, 7r x (t) £ (a). From the previous Lemmas, 


TTa : (f) - 7T x (f)|| < 
< 


hi c K 2 Vd t 

H2 Vd, 

Hi c K.2Vdp(m)5- 2 ~i < yfrji. 
K2\[d 


The last inequality follows from the definition of <5. Hence, tt x (t) £ B(ir x (t), y/fji). It follows 
from part 1 that tt x (t) £ C'(a!). The fact that a' > 0 follows from the fact that rj i < a 2 /C g . 

3. From Theorem 3, p has a mode rh in C t( a') and has no other critical points in C t ( a'). 
So, letting z = %T x (t), and noting that z is on the path TT x (t), 


dest(x) = lim tt x (s) = lim n z (s) = m. 

s—yoo s—>• oo 


Applying steps 1, 2 and 3 to y we have that dest(y) also equals m. 


Now let ph be the kernel density estimator with h = h ri 
rji = Op(n -2 /( 6+rf )) so, with 5 defined as in (20), we have 


5 = 


(logn 


I - 1/(27) 


n V( 5 + d ). In this case 


By the previous theorem, there are no clustering errors for data points Xi such that 
d(Xi,D) >z d n where D = (J -dCj as long as r\\ = sup x ||Vp(x) — Vp(x)|| < a 2 /C g which 
holds except on a set of exponentially small probability (Lemma 1). Hence, we have: 


Corollary 12 Assume thatp is a Morse function with finitely many critical values. Denote 
the modes and clusters by mi ,... ,mk and C\,... ,Ck. Suppose that condition (B) holds in 
each cluster. Let p^ be the kernel density estimator. Let rj = max{r/o, ??i, 1 / 2 } where 

VO = sup|p/j(x) -p(x) |, Vl = sup I \Xph(x) -Xp{x) ||, Tj 2 = sup I \X 2 ph(x) - V 2 p(x)\\. 

XX X 

Let D = [Jj dCj and let 

X = {X i: d(Xi,D)>6 n } 
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where 


If K 


s _ ( K 2 Vdp(m) \ 27 
\\og(K 2 Vd/^/rn) J 

0 and nht + 4 -» oo, then 


(log n 


r 1/(27) 


c(Xi,Xj ) / c(Xi,Xj ) for any 


Xi, Xj 


€ X <e 


—nb 


( 21 ) 


for some b > 0. 

Thus, the clustering risk is exponentially small if we exclude points that are close to the 
boundary. 

5. Experiments 

An example of highly non-spherical mode clusters in two dimensions is given in Figure[7j left 
panel. The true density (contours shown in blue) has two modes, with the corresponding 
basins of attraction shown in blue and green. Mean shift (using a Gaussian kernel with 
bandwidth 1) is applied to the 1000 points sampled from the density as plotted, and all but 
the points shown in red are correctly clustered. All but 1% of points are correctly clustered, 
despite a total variation distance of about 0.29 between the true and estimated densities. 

Our theoretical results show that mean shift clustering should perform well even in high 
dimensions, assuming the bulk of the basins of attraction are well-separated by low density 
regions. We simulate such a setting in 10 dimensions, were we measure the performance 
of mean shift clustering on samples drawn from a mixture of two equal weight Gaussian 
components. The norm of the difference between the means is 5, and each component has 
randomly generated non-spherical covariance matrix with eigenvalues between 0.5 and 2. 
The center panel of Figure [7] shows the average clustering error as a function of the sample 
size n and bandwidth h. after 75 replications of the procedure. With only 50 samples, an 
average error of 0.05 is achieved with the appropriate bandwidth. 

The effect of component separation is demonstrated further in the right panel of Figure[7j 
Here, we draw n = 300 samples from an equal weight mixture of two unit covariance 
Gaussians in two dimensions, and measure the clustering error of mean shift (averaged over 
35 replications). 

6. Conclusion 

Density mode clustering — also called mean-shift clustering — is very popular in certain 
fields such as computer vision. In statistics and machine learning it is much less well known. 
This is too bad because it is a simple, nonparametric and very general clustering method. 
And as we have seen, it is not necessary to estimate the density well to get a small clustering 
risk. Because of this, mode clustering can be effective even in high dimensions. 

We have developed a bound on the pairwise risk of density mode clustering. The risk 
within the cluster cores — the high density regions — is very small with virtually no 
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Figure 7: Left: example of highly non-convex basins of attraction. Center: small sample 
complexity in high dimensions due to well-separated clusters. Right: effect of 
cluster separation, ranging from nearly unimodal to having two well-separated 
modes. 


assumptions. If the clusters are well-separated (low noise condition) then the overall risk is 
small, even in high dimensions. 

Several open questions remain such as: how to estimate the risk, how to choose a good 
bandwidth and what to do when the low noise condition fails. Regarding the last point, we 
believe it should be possible to identify regions where the low noise conditions fail. These 
are essentially parts of the cluster boundaries with non-trivial mass. In that case, there are 
two ways to reduce the risk. One is to merge poorly separated clusters. Another is to allow 
ambiguous points to be assigned to more than one cluster. For research in this direction, 
Li et al. ( 2007| ); Chen et al. (2014). 


see 
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